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A Method for Compiling Higb-Uvel Language Programs to a Reconfigurafate Para-flow Processor 2 
1 Introduction 

This document describes a method for compiling a subset of a high-level programming language (HLL) 
like C or FORTRAN, extended by pott access functions, to a reconfigurable data-flaw processor (RDFP) 
as described in Section 3. The program is transformed to a configuration of the RDFP. 

This method can be used as pan of an extended compiler for a hybrid architecture consisting of standard 
host processor and a reconfigurable data-flow coprocessor. The extended compiler handles a full HLL. 
like standard ANSI C. U maps suitable program parts like inner loops to the coprocessor and the rest 
of the program to the host processor: li is also possible to map separate program parts to separate 
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Compilation Flow 

This seed on briefly describes the phases of die compilation method. 
2.1 Frontend 

The compiler uses a standard frontend which translates the input program (e. g. a C program) into an in- 
ternal format consisting of an abstract syntax tree CAST) and symbol tables. The frontend also performs 
well-known compiler optimizations as constant propagation, dead code elimination, common subexpres- 
sion elimination etc. For details, refer to any compiler construction textbook like ( 1 ]. The STJTF compiler 
[2J is an example of a compiler providing such a frontend. 

2JZ Control/Dataflow Graph Generation 

Next, the program is mapped to a control/dataflow graph (GDFG) consisting of connected RDFP func- 
tions. This phase is the main subject of this document and presented in Section 4. 

23 Configuration Code Generation 

Finally, die last phase direedy translates the CDFG to configuration code used to program the RDFP. For 
PACTXPP™ Cores, the configuration code is generated as an NML (Native Mapping Language) file. 

3 Configurable Objects and Functionality of a RDFP 

S S™ de *^ beS lhc ccnfigUraHe ol ^ ects a** ^nationality of a RDFP. A possible implementation 
£^^* KCMre * »?ACTXPP«Core. Here we only describe the minimum re^dremems for 
a RDFP for this compilation method to work. The only data types considered are multi-bit words called 

and single-bit control signals called eve**. Data and events are always processed as voders cf. 
Section 32. Event packets are called 1-events or (^events, depending on their bit-value 
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3.1 Configurable Objects and Functions 

An RDFP consists of an array of configurable objects and a communication network. Each object ran 
be configured to perform cotain functions (listed below), ft performs the same function repeatedly until 
the configuration Is change* The array needs not be completely uniform, i. e. not all objects need to be 
able to perform all functions. E.g., a RAM function can be implemented by a specialized RAM object 
which caimot perform any other functions. It is also possible to combine several objects id a -macro'* to 
realize certain functions. Several RAM objects can. eg., be combined to realize a RAM function with 
larger storage. 
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Figure I: Function? of an RDFP 

The following functions for processing data and event pockets can be ironfisured into an RDFP. See Hg- 1 
for a graphical leprescnQCion. 

• ALU[opcodeJ: ALUs perfbnn common arithmetical and logical operations on data. ALU func- 
tions Copcodea") nsust be available for all operations used in the HLL. 1 ALU functions have two 
data inpuis A and B. and one data output X. Comparers have an event output U instead of the 
dam output- They produce a l-evem if the comparison is ttue, and aO-«vent otherwise* 

^Oihcrwjac programs conmining opeptta tvbfcft to txn have ALU opcodes in fte RDFP must &e excluded fmm the 

juppoficu HIX subset or anbstimutf by -taacrosT «r existing rrocflon*. 
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• CNT: A counter function which has daia inputs LB, UB and INC (lower bound, upper bound 
and increment) and data output X (counter value). A packet at event input START starts the 
counter; and event input NEXT causes the generation of the ne*i output value (and output events) 
or causes the counter to terminals if UB is readied If NEXT is not connected, the counter counts 
continuously. The output events U t V, and W have the following functionality: For si counter 
counting Njimes, N-l 0-events and one 1 -event are generated at output U, At output V % N 6-events 
are generated, and at output W, N Q-evcnts and one I-event are created. The l-event at W is only 
created after the counter has terminated. i.e. a NEXT event packet was received after the last data 
packet was output. 

• RAM[size]: The RAM fcnouon stress a fixed number of data wards ("size"), ft has a data input 
RD and~a data output OUT for reading at address RD, Event output ERD signals completion of 
the read access- For a write access, data inputs WR and IN (address and value) and data output 
OUT is used. Event output EWR signals completion of the write access, ERD and EWR always 
generate O-events. Note that external RAM can be handled as RAM functions exactly lite internal 
RAM. 

m GATE: A GATE synchronizes a data packet at input A back and an event packet at input EL When 
both inputs have arrived, they are both consumed. The data packet is copied to output X. and the 
event packet to output U. 

• MUX: A MUX function has 2 daia inputs A and B. an event input SBU and a data output X. If 
SEL receives a 0-evenc input A is copied to output X and input B discarded. For a 1 -event, B Is 
copied and A discarded. 

• MERGE: A MERGE function has 2 data inputs A and B, an event input SEL, and a data output X, 
If SEL receives a O-event. input A is copied to output X but input Biswor discarded. The packer 
is left at the input B instead. For a 1 -event, B is copied and A left at the input: 

• DEMUX: A DEMUX function has one data input A, an event input SEL. and two data outputs X 
and Y. If SEL receives a O-event, input A is copied to output X. and no packet is created at output 
Y. For a 1-event A is copied to Y» and no packet is created at output X. 

• MDATA: A MDaTA function imiliiplicatcs data packets. It has a data input A an event input 
SEL, and a data output X, If SEL receives a I-event, a data packet at A is consumed and copied 
to output X. For all subsequent 0>event at SEL, a copy of the input data packet is produced at the 
output without consuming new packets at A. Only if another 1 -event arrives at SBU die next data 
packet at A is consumed and copied,* 

• INPORT[name]: Receives daia packets from outside the RDFF through input port "name"* and 
copies them to data output X, If a packet was received, a 0-evenl is produced at event output 
too, (Note that this function can only be configured at special objects connected to external busses.) 

• OUTPOKTtnanie]: Sends data packets received at data input A to thd outside of the RDFP through 
output port ^name^ If a packet was sent, a 0-event is produced at event output U„ too, (Note that 
this function can only be configured ar special objects connected to external busses,) 

Additionally,, the foU<ywmg fun ctions manipulate only event packets: 
a Note Out this can be implemented by a MERGE with special properties on XPP™ . 
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. O-FILTER, 1 -FILTER: A FILTER has an input E and an output U. A MTLTER L copies a ^ 
fremEtoU.but 1-EVENTs at E are discarded. A I -FILTER copies 1 -events and djscards O-events. 

• INVERTER: Copies all events from input E W output U but inverts its value. 

• (MIONSTANX.l-CONSTANT: ^CONSTANT copies all events from hipuL E to output U, but 
changes theni all to value 0. 1 -CONSTANT changes all 10 value I. 

. ECOMB; Combines two or more inputs El, E2. E3..„ producing a packet at output U. Theoutput 
is a Invent if and only if one or more oFtn* input packets are J-events (logical or). A packet must 
be available at all inputs before an ouput packet is produced. 

• ESEQM: An ESEQ generates a sequence "sen" of events. e.g "000 T « its output U. Iff it 
has an input START, one entire sequence is generated for each event packaarrmng at UJne 
sequence is only repeated if die next event arrives at U. However, if START is not connected. 
ESEQ constantly repeats the sequence. 

Note that ALU. MUX. DEMUX, GATE and ECOMB functions behave Uke their equivalents in classical 
dataflow machines [3, 4J. 



3.2 Packet-based Communication Network 

The communication network of an RDFPcan connect an outputs of one object Cue. its respective func- 
tion) to the ioputCs) of one or several other objects, lids is usually achieved ^^^^J^** 
placing the functions properly on the objects, many functions can be connected arbitrarily up to a hra« 
imposed by the device size. As mentioned above, all values are communicated as packers. A separate 
communication network exists for data and event packets. The packets synchronize the funcuons as m a 
dataflow machine with acknowledge (5J. I. e., the function only executes when all input packets are avail- 
able ( a ™« from the non-strict exceptions as described above). The function also stalls if the last output 
packet has not been consumed. Therefore a data-flow graph mapped to an RDEP self-synchronizes its 
execution without the need for external control. Only if two or more function outputs (data wevcm)arc 
connected to the same function input CN to 1 connection-), the self-synchroniaauon is disabled. The 
user has to ensure than only one packer arrives at a time in a correct CDFG. Otherwise a packet might 
get lost, and the value resulting fiom combining two or more packets is undefined. However, a function 
output can be connected to many function inputs ("i to N connection") without problems. 

There are some special cases: 

, A function input can be preloaded with a distinct value during configuration. This packet is con- 
sumed like a normal packet coming from another object. 

. A function input can be defined as cvnsmni. In this case, the packet at the input is reproduced 
repeatedly for each function execution. 

J Neu: thai this function is fmclsmenicci by \hz EAND operator on the XPF™ . 

"Now ton on XPP™ Coro. aTStol connection- for ccao iswslized hy the EOR ftmcuon. and for dauby just wngmns 

several outputs to an input. 
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An RPFP require* register delays in the dataflow. Otherwise very long combinational delays and asyn- 
chronous feedback is possible- Wc assume that delays are inserted at the inputs of some functions (like 
for most ALUs) and hi some routing segments of the communication network. Note ihac registers change 
die timing, bit not the functionality of a correct CPFG. 

4 Configuration Generation 

4.1 Language Definition 

The following HLL features are not supported by the method described here: 

• pointer operations 

• library calls,, operaiiag system calls (including standard I/O functions) 

• recursive function calls (Note that non-recursive function calls can be eliminated by function in- 
lining and therefore are not considered here.) 

• AH scalar data types are convened to type integer. Integer values are equivalent to data packets in 
the RDFR Arrays (possibly multi-dimensional) are ihe only composite data types considered. 

The following additional features are supported: 

TNPORTS and OUTFORT5 can be accessed by the HLL functions gewwmfneme, value) and pia- 

smam(name m value) respectively. 

4.2 Mapping of ffigh-Level Language Constructs 

This method converts a HLL program to a CDFG consisting of the KDFP functions defined in Section 3.1. 
Before the processing stares, all HLL program arrays are mapped to KDFP RAM functions. An amy * 
is mapped to RAM RAMCx). If several arrays are mapped to ihe same RAM. an offset is assigned, too. 
Ihe RAMs ate added to an initially empty CDFG. There must be enough RAMs of sufficient size for aJJ 
program arrays. 

The CDFG is generated by a traversal of the AST of xfae HLL program- It processes the program state- 
ment by statement and descends into the loops and conditional statements as appropriate. The following 
two pieces of information are updated at every program point 5 during the traversal: 

* START points to an event output of a RDFP function. This output delivers a O-event whenever 
the program execution reaches this program poinc At the beginning, a O-CQNSTANT preloaded 
with an event input ts added to the CDFG. (ft delivers a O-event immediately after configuration.) 
START initially points to its output. This event is used to start the overall program execution. The 
STABTncv signal generated after a program pan has finished executing is used as new START 
signal for the following program pan g, or it signals termination of the entire program. The START 

3 In a program, F w $ mm point* are beewsn two (rarcrawo or before the beginning or after U>e end of a program 
like a loop or a conditional staicmcBL 
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events guarantee that the execution order of the original program js maintained wherever lhe data 
dependencies alone are vol sufficient. This scheduling scheme is similar lo a one-hot controller 
for digital hardware. 

• VARLIST is a lis; of {variable. Junction-output) pairs. The pairs map integer variables or array 
elements 10 a. CDFG function's output. The first pair far a variable in YARLI5T contains the 
output of the function which produces the value of this variable valid at the current program point. 
New pairs are always added 10 the front of VARLIST. The expression VARD£F( v ar) refers 10 the 
junctionrcutpul of the firs* pair with variable var in VARLIST, 6 

The following subsections systematically lisx all HLL program components and describe how they are 
processed, thereby altering the CDFG, START and VARLIST. 

42*1 Integer Expressions and Assignments 

Straight-line code without array accesses can be directly mapped to a data-flow graph. One ALU is 
allocated for each operator in the program. Because of the self-synchroni2arion of the ALUs, no explicit 
control or scheduling is needed. Therefoie processing these assignments does not access or alter START. 
The data dependences (as they would be exposed in the PAG representation of the program [1]) are 
analyzed through the processing of VARLIST These assignments synchronize themselves through the 
data-flow. The data-driven execution automatically exploit* the available instruction level parallelism. 

All assignments evaluate the right-band side (RHS) or source expression, This evaluation results in a 
pointer to a CDFG object's output (or pseudo-object as defined below). For integer assignments, ifae 
left-hand side (LHS) variable or destination is combined with the RHS result object to form a new pair 
{LHS. result(RHS)> which is added to the front of VARLIST. 

Tbe simplest statement is a constant assigned to an integer 7 

a - 5; 

1% doesn't change the CDFG. but adds {a, 5} to the from of VARUST. The constant 5 is a ''pseudo- 
objecf • which only holds the value- but does not refer lo a CDFG object Now YARDEF(a) equals 5 al 
subseqent program points before a is redefined. 

Integer assignments can also combine variables already defined and constants: 
b - a * 2 3? 

In the AST, die RHS is already converted to an expression irec. This tree is transformed to a combination 
of old and new CDFG objects (which are added to the CDFG) as follows: Each operator (internal node) 
of the tree is substituted by an ALU with the opcode corresponding to the operator in the tree. If a leaf 
node is a constant* the ALU's input is directly connected to thai constant. If a leaf note is an integer 
variable var, it is looked up in VARLIST. U VARDEF(var) is retrieved. Then VARDEF(var) (an ouiput 
of an already existing object in CDFG or a constant) is connected to tbe ALU's inpuL Tbe output of the 
ALU corresponding to the root operator in the expression tree is defined as the result of the RHS. Finally, 
a new pair {LHS, result(RH5)} is added lo VARUST. If the two assignments above are processed, the 

"This tneihod of using a VARLIST is adapted from tte Traflsmogrffier C cprnpiter 151, 
7 Ncte ttm we use C syntax for tftc following ararnples* 
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CDFG with two ALUs in Fig. 2 is created. 8 Outputs occurring in VARLIST are labeled by Roman 
munbeis. After these two assignments, VARLIST = [{b. I}, {a, 5}J. (The front of the list is on tte left 
side.) Note that all inputt connected to a constant (whether direct from the expression tree or retrieved 
from VARUST) must be defined as constant. Inputs defined as constants have a small c next to (he input 
arrow in Fig. 2. 

*22 Conditional Integer Assignments 

For conditional iF-then-else statements containing only integer assignments, objects for condition eval- 
uation are created first. The object event output indicating the condition result is tept for choosing 
die correct branch result later. Next, both branches arc processed in garallel. using separate copies 
VARUST! and VARLISTZ of VARLI5T. (VARLIST itself is not changed,) Finally, for all variables 
added to VABLISTI or VARLI5T2, a new entry for VARUST is created (combination phase). The valid 
definitions from YARL25T1 and VARLI5T2 are combined with a MUX function, and the correct input 
is selected by the condition result For variables only defined in one of the two branches, the multiplexer 
uses the result retrieved from the original VARUST for die other branch. If the original VARLIST docs 
not have an entry for this variable, a special "undefined" constant value is used. However, in a function- 
ally correct program this value will never be used. As an optimization, only variables live £1 J after the 
if-rtien-ei$e structure need to be added to VARLIST in the combination phase. 9 

Consider the following example: 
i - 7 f - 

a = 3? 

if U * 10) [ 
a - 5; 
c » 7; 

> 

else { 

c = a - 1? 
d * a? 

} 

Fig. 3 shows the resulting CDFCJ, Before the if-then-eke construct, VARLIST = r{a, 3), {i» 7}]. After 
processing the branches, for the then branch, VARLISTI = [{a 7}. {a. 5}, {a. 3}. {i. 7}]. and for die 
else branch, VARLISTZ = [{d, 0>, {c, I}, {a, 3}, {i, 7}], After combination, v A RUST= [{d, II}, {c, 
m>,{a,IV},{OMu7}). 

Note that case- or switch-statements can be processed, too, since they can - without loss of generality - 
be converted id nested if-ihcn-clsc statements. 

Processing conditional statements this way does not require explicit control and does not change START. 
Both branches are executed in parallel and synchronized by the data-flow. It is possible to pipeline the 

dataflow for optimal throughput, 

*Ncuc that the input and output names can be deduced fiom their ptxadva, cf, Rfr J- Abo note thai tftc compiler ftonv- 

end would wnmxlly birrc ssrtw'lutexl rjb= second assignment by b = 13 (constant propagation). Far the simplicity of ihis 
explanation, no fro mend optimizations are considered in this and the following examples* 

^Dcfiniuon: A variable live at a program point if i» value is wad ai a swemeut Teachable ftom hen? without intermediate 
redefinition. 
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■ 

123 General Conditional Statements 

Conditional statements containing either array accesses (cf. Section 4.2.7 below) or inner loops cannot 
be processed as described in Section 4.22. Data packets must only be sent to foe active branch. This is 
achieved by the implementation shown in Kg- 8, similar to the method presented in [4]. 

A dataflow analysis is performed to compute used sets use and defined sets def [I] of both branches. 10 
For the current VARLI5T entries of all .variables in IN = usa[ikcnbody) U def{tkenbody) U 
use{el$cbody) U tfe/(etee6ody) U tu?e(fce*rfer) ? DEMUX functions controlled by the IF condition are 
inserted. Note that arrows with double lines in Fig. 8 denote connections for all variables in IN. and the 
shaded DEMUX function stands for several DEMUX functions, one for each variable in IN. The DE- 
MUX functions-forward data packets only to the selected branch. New lists VARLISTI and VARL1ST2 
are compiled with the respective outputs of these DEMUX functions. The then-branch is processed with 
VARLISTI, and the ebe branch with VARUS7X finally, the output values are combined. OUT con- 
tains the new values for the' same variables as in IN. Since only one branch is ever activated there will not 
be a conflict due to two packets aniving sixuuttanuousiy. The combinations will be added to VARW5T 
after the conditional statement. If the IF execution shall be pipelined. MERGE opcodes for the output 
must be inserted, too. They axe controlled by the condition like the DEMUX functions. 

The following exrension with respect to 14] is added (dotted lines in Fir. 8) in order to control the execu- 
tion as mentioned above with START events: The STaBT input is ECOMB-combined with the condition 
output and connected to the SEL input of the DEMUX functions. The START inputs of thenbody and 
elsehody are generated from the ECOMB output sent through a 1 -FILTER and a 0-CONSTANT" or 
through a O-FILTER, respectively. The overall START^ output is generated by a simple "2 to 1 
connection"* of thenbody's and elsebod/s ST AFT w outputs. With thia extension, arbitrarily nested 
„ conditional statements or loops can be handled within thenbody and eJsebedy. 

4^4 WHILE Loops 

WHILE loops are processed similarly to the scheme presented in (4]„ cf. Fig- 9. As in Section 4.23 T dou- 
ble line connections and shaded MERGE and DEMUX functions represent duplication for all variables 
in IN. Here IN « use{whilcbody) U dtzf{whilebody) U vse(header). The WHILE loop executes as 
foljws: In the first loop iteration, the MERGE functions select aD input values from VARLIST at loop 
entry (SEL=*>). The MERGE outputs are connected to the header and the DEMUX functions, tf the 
while condition is true (SEL=1), the input values arc forwarded to the whilebody. otherwise to OUT. 
The output values of the while body are fed back to whilebody's input via the MERGE and DEMUX 
operators as long as the condition is true. Finally, after the last iteration, they are forwarded to OUT, The 
outputs arc added to the new VARLIST. 12 

Two extensions with respect to [4] are added (dotted lines in Fir. 9): 

1(> A variable is used in a statement (and bence in a program region containing this statement if its value is read. A variable 
is defined in a statement (or region) if anew value is assigned to il 

1 1 Trie 0-CONSTANT is required since START events must always be ^events. 

' s Ngte that the MERGE function for variables not live at the loop's beginning and the whiiceody *$ beginning can be removed 
since its output is not used. For these variables, only the DEMUX function io output the Gnat value is required. Also note ffiai 
the MERGE functions can be replaced by Simple "3 to 1 connections* if 4c configuration process* guarantees that packets from 
INI always arrive at (he DEMUX"? input before feedback values arrive 
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• In [4], the SQL input of the MERGE functions is preloaded with 0. Hence the loop execution 
begins immediately and can be executed only once. Instead, we connect the START input to die 
MERGE's SEL input ("2 to 1 connection" with the header output). This allows to control the dnze 
of die sua of the loop execution and to restart il 

• The whilebody's START input is connected to the header output, sent through a I-HLTER/0- 
CON5TANT combination as above (generates a 0-event for each loop iteration). By ECOMB* 
combining whilebody's STARTnem output with the header output for the MERGE functions* 
SEL inputs, the next loop iteration is only started after the previous one has finished. The while 
loop's STABT-new output is generated by filtering the header output for a 0-event. 

With these extensions, arbitrarily nested conditional statements or Joops^can be handled within while- 
body. 

4.25 FOR Loops 

FOR loops are particularly regular WHILE loops. Therefore we could handle them as explained above. 
However, our RDFP features the special counter function CNT and the data packet multiplication func- 
tion MDATA which can be used for a more efficient implementation of FOR loops. This new FOR loop 
scheme is shown in Fig. 1 0. 

A FOR loop is controlled by a counter CNT. The lower bound (LB), upper bound (UB), and increment 
(INC) expressions are evaluated like any other expressions (see Sections &2A and and connected 
to the respective inputs. 

As opposed to WHILE loops* a MERGJE/DEMUX combination is only required for variables in JNl = 
&tf{forbody) w i. e. those defined in forbody. 13 INI does not contain variable* which axe only used 
in forbody, LB, UB, or INC, and does also not contain the loop index variable. Variables in INI are 
processed as in WHILE loops, but the MERGE and DEMUX functions' SEL input is connected to 
CNT's W output. (The W output does the inverse of a WHILE loop's header output; it outputs a 1- 
event after the counter has terminated. Therefore the inputs of the MERGE functions and the outputs 
of the DEMUX functions ate swapped here, and the MERGE functions' SEL inputs are preloaded with 
1-evencj.) 

CJ^Ts X output provides the current value of the loop index variable. If die final index value is required 
dive) after the FOR loop, it is selected with a DEMUX function controlled by CNTs V event output 
(which produces one event for every loop iteration). 

Variables in INZ = u$e(forbod y ) \ def {forbody). i. a. those defined outside the loop and only used 
(but not redefined) inside the loop are handled differently. Unless it is a constant value, the variable's 
input value (from VARLI5T) must be reproduced in each loop iteration since it is consumed in each 
iteration. Orheiwfce the loop would stall from the second iteration onwards. The packets are reproduced 
by MDATA functions, with the SEL inputs connected to CNFs U output. The SEL inputs must be 
preloaded with a 1 -event to select the first input. The 1 -event provided by the last iteration selects a new 
value for die next execution of the entire loop. 

~^^J^J^ CE !!^^^ te * 5im Pte^» I connections* as for WHILE loops irtheconfieuration 

process gumaraes Uvu pacta* from JNl always arrive « die DEMUX's input before feedback **Uies arrive. * 



Empf.zeit 106/12/2002 14:55 



0G-DEZ-2002 14=53 




P. PIETRUK 




1 469388 



S.17 



A Method for Compiling High-Level Language ftognuns to a Rcconfigurablz Data-flow Processor 1 1 

The following control events (dotted lines io Fig. 10) axe similar to the WHILE loop extensions, but 
ampler CNTs START input is connected to the loop's overall START signal. ST ART n^, is generated 
fiom CNTs W output, sent through a l-HLTER and 0-CON5TANT CNTs V output produces one o- 
event for each loop iteration and is therefore used as forbedy's START. Finally, CNTs NEXT input is 
connected to forbody's 5TAKT n cw output- ;: 
For pipelined loop^as defined below in Section 4.2,6), loop iterations are allowed to overlap- Therefore 
CNTs NEXT input needs not be connected. Now the counter produces index variable values and control 
events as fast as they can be consumed. However, in this case CNTs W output in not sufficient as overall 
STAKTnap ou^ul since the counter terminates before the last iteration's forbody finishes. Instead, 
STAKr 7gcw is generated from CNTVi U output BCOMB-combined with forbody's STAjtT ne w output, 
sentthrougha l-FILIER/0*CONSTANTcombinaUon. "Hie ECOMB produces an event after esnxunaiion 
of each loop iteration, but only die last event is a 1-evem because only ther last output of CNTs V output 
is a I -event Hence this event indicates that the last iteration has finished. Cf. Section 4u3 for a FOR loop 
example compilation with and without pipelining. 

As for WHILE loops, these methods allow to process arbitrarily uested loops and conditional statements. 
The following advantages over WHILE loop implementations are achieved: 

• One index variable value is generated by the CNT function each clock cycle. This is faster and 
smaller than the WHILE loop implementation which allocates a MERGE/DEMUXfADD loop and 
a comparator for the counter functionality. 

• Variables in IN2 (only used in forbody) are reproduced in the special MDATA functions and need 
not go through a MERGE/DEMUX loop. This is again faster and smaller than the WHILE loop 
implementation. 

4£%$ Vectorizatioo and Pipelining 

The method described so far generates CDFGs performing the HIX program^ functionality on an RDFR 
However, the program execution is unduly sequentialized by the START signals. In some cases, inner- 
most loops can be vectorized. This means that loop iterations can overlap, leading to a pipelined dataflow 
through the operators of the loop body. The Pipeline Vcccorizarivn technique [6] can be easily applied to 
die compilation method presented here. As mentioned above, for FOR loops, the CNTs NEXT input is 
removed so that CNT counts continuously, thereby overlapping the loop iterations. 

All loops without array accesses can be pipelined since the dataflow automatically synchronizes hop* 
carried dependences, i. e. dependences between a statement in one iteration and another statement in a 
subsequent iteration. Loops with array accesses can be pipelined if the array {u e- RAM) accesses do 
not cause Icop-carried dependences or can be transformed to such a form. In this case no RAM address 
is written in one and read in a subsequent iteration. Therefore the read and write accesses do the same 
RAM may overlap. This degree of freedom is exploited in the RAM access technique described below. 
Especially for dual-ported RAM it leads to considerable performance improvements. 

42.7 Array Accesses 

In contrast to scalar variables, array accesses haw to be controlled explicitly In order to maintain the 
program's correct execution order. As opposed to normal dataflow machine models [3], a KDFP does 
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not have a single address space. Instead, the arrays are allocated to several RAMs. This leads to a 
different approach to handling RAM accesses and opens op new opportunities for optimization. 

To reduce the complexity of the compilation process, array accesses are processed in two phases. Phase 
1 uses "pseudo-functions" for RAM read and write accesses. A RAM read function has aRD data input 
(read address) and an OUT data output (read value), and a RAM write function has WR and IN data 
inputs (write address and write value). Both functions are labeled with the umy the access refers to, and 
both have a START event input and a U event output. The events control die access order. In Phase 2 all 
accesses to the same RAM are combined .and substituted by a single RAM function as shown in Fig, 1. 
This involves manipulating the data and event inputs and outputs such that the correct execution order is 
maintained and the outputs are forwarded to the correct part of the CDFG. 

Phase 1 Since arrays are allocated to several RAMs, only accesses to the same RAM have to be syn- 
chronized. Accesses lo different RAMs can occur concurrently or even out of order. In case of data 
dependencies, the accesses self-synchronize automatically. Within pipelined loops, not even read and 
write accesses to the same RAM have to be synchronized. This is achieved by maintaining separate 
START signals for every RAM or even separate START signals for RAM read and RAM write accesses 
in pipelined loops. At die end of a basic block HI 14 , all START ne w outputs must be combined by a 
ECOMB to provide a START signal for the next basic block which guarantees that all array accesses in 
die previous basic block are completed. For pipelined loops, this condidon can even be relaxed. Only 
after the loop exit all accesses have to be completed. The individual loop ileiadons need not be synchro- 
nized. 

First the RAM addresses axe computed. The compiler froniend's standard transformation for array ac- 
cesses can be used* and a CDFG Junction's output is generated which provides the address. If applicable, 
the offset with respect to the RDFP RAM (as determined in the initial mapping phase) must be added. 
This output is connected to the pseudo RAM read's RD input (for a read access) or lo Ihe pseudo RAM 
write** WR input (for a write access). Additionally; the OUT output (lead) pr IN input (write) is con- 
nected- The START input is connected id the variable's START signal, and the U output is used as 
STARTn** for the next access. 

To avoid redundant read accesses, RAM reads are also registered in VaRLIST. Instead of an integer 
variable, an array element is used as first element of the pair. However, a change in a variable occurring 
in an array index invalidates the information in YARL1ST. It must then be removed from it. 

The following example with two read accesses compiles to the intermediate CDFG shown in Fig. 1 2. The 
START signals refer only to variable a. STOP I is the event connection which synchronizes the accesses. 
Inputs START (old) T i and j should be substituted by the actual outputs resulting from the program before 
the array reads. 



x - ati); 



V - a£jl; 
z = x f y; 



Fig. 13 shows the translation of the following write access: 



a[il = x; 



*A basic Black k a program put with a jfcigk ercry ami a single exit potou Lea piece of siraigtt-ltnc cede. 
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phase z We now merge the pseudo-functions of all accesses to the same RAM and substitute them by 
a single RAM function. For all data inputs (BX> for read access and "WR and IN for write access), GATEs 
are inserted between the input and the RAM function. Their E inputs are connected to the respective 
START inputs of the original pseudo-functions. If a RAM is read and written ai only one program point, 
the U output of the read and write access is moved to the ERD or EWR output, respectively. For example, 
the single access a Til - *; from Fig. 13 is transformed to the final CDFG shown in Fig. 5. • 

However, if several read or several write accesses (i. e. pseudo-functions from different program points) 
to the same RAM occur, die ERP or EWR events are not specific anymore. &ui a 5TAJKT ncw event of 
the original pseudo function should only be generated for the respective program point, i. e~ for the cur* 
ma access. This is achieved by connecting the START signals of all ether accesses (pseudo-functions) 
of the same type (read ox write) with the inverted START signal of the current access. The result- 
ing signal produces an event for every access, but only for the current access a 1 -event- This event is 
ECOMB-combined with the RAM'S ERD or EWR output. The ECOMB's output will only occur after 
the access is completed. Because ECOMB OR-combines its event packets, only the current access pro- 
duces a 1 -event. Next, this event is filtered with a 1 -FILTER and changed by a 0-CONSTANX resulting 
in a STAKTnm signal which produces a Q-event only after the current access is completed as required. 

For several accesses, several sources are connected to the KD. WR and IN inputs of a RAM. This disables 
the sdf~synchronization. However, since only one access occurs at a time, the GATEs only allow one 
data packet to arrive at the inputs. 

For read accesses, the packets at the OUT output face the same problem as (he ERD event packets: 
They occur for every read access, but must only be used (and forwarded to subsequent operators) for 
the current access. This can be achieved by connecting the OUT output via a DEMUX function. The Y 
output of the DEMUX is used, and the X output is left unconnected. Then it acts as a selective gate which 
only forwards packets if its SEL input receives a 1 -event, and discards its dam input if SEL receives a 
0-event- The signal created by the ECOMB described above for the STAKT^ signal creates a 1 -event 
for the current access, and a O-event otherwise. Using it as the SEL input achieves exactly the desired 
functionality. 

Fig- 4 shows the resulting CDFG for the first example above (two read accesses), after applying the 
transformations of Phase 2 to Fig. 12. STOP I is now generated as follws: 5TART(old) is inverted, 
"2 xo 1 connected" to STOP1 (because it is the START input of the second- read pseudo-function), 
ECOMB -combined with RAM's ERD output and sent through the 1-F1LTER/0-CONSTANT combina- 
tion. START(new)is generated similarly, but here 5TART(oJd) is dirccdy used and STOP! inverted. The 
GATEs for input IN (i and j) are connected to START(oId) and STOP1, respectively, and the DEMUX 
functions for outputs x and y are connected to the ECOMB outputs related to STOP1 and STARTCnew), 

Multiple write accesses use the same control events, but instead of one GATE per access for the RD 
inputs, one GATE Tor WR and one gate for IN (with the same E input) arc used. The EWR output is 
processed like die ERD output for read accesses. 

This transformation ensures that all RAM accesses are executed correctly, but it is not very fast since read 
or write accesses to the same RAM axe not pipelined. The next access only starts after the previous one 
is completed, even if the RAM being used has several pipeline stages. This inefficiency can be removed 
as follws: 

First continuous sequence? oF either read accesses or write accesses (not mixed) within a basic block are 
detected by checking for pseudo-functions whose V output is directly connected to the START input of 
another pseudo-function of the same RAM and the same type (read or write). For these sequences it is 
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possible co stream data into the RAM rather than waiting for the previous access to complete. For this 
purpose, a combination of MERGE functions selects the RD or WR and IN inputs Id the order given 
by the sequence. The MERGES must be controlled by iterative ESEQs guamn teeing that the inputs axe 
only forwarded in the desired order. Then only the first access in the sequence needs to be controlled by 
a GATE or GAIEs. Similarly, the OUT outputs of a read access can be distributed more efficiently for 
a sequence. A combination of DJSMUX functions with the same ESEQ control can be used. It'is most 
. efficient to arrange the MERGE and DEMUX functions as balanced binary trees. 

The STAKTnoff Signal is generated as follows: For a sequence of length n„ the START signal of the 
entire sequence Is replicated □ times by an ESEQ[00,.I] function with the START input connected 10 
the sequence's START, Its output is dliecdy to I connected 1 ' with the other accesses' START signal 
(for single accesses) or ESEQ outputs sect through 0-CONSTANT (for access sequences), ECOMB- 
connecied to EWR or ERD, respectively, and sent through a 1 -FILTER/O^ONSTANT combination, 
similar to the basic method described above. Since only the last ESEQ output is a 1 -event, only the 
lost RAM access generates a STAJZT nGr0 as required. Alternatively, for read accesses, the generation 
of the last output can be sent through a GATE (without the E "input connected), thereby producing a 
STAHT nem event. 

Fig, 14- shew* the optimized version of the first example (Figures 12 and 4) using the BSEQ-method for 
generating START nea , 9 and Fig. 6 shows the final CDFG of the following. larger example with three 
array reads. Here the latter method for producing the ST AKTncm event is used. 

x = a[ij; 
y - a£3J? 
z * a[fc); 

If several xead sequences or read sequences and single read accesses occur for the same RAM, 1 -events 
for detecting the current accesses must be generated for sequences of read accesses. They am needed 
to separate the OUT-valucs relating to separate sequences. The ESEQ output just defined, sent through 
a 1-CONSTANT achieves this. It 5s again *TI to I connected" to the other accesses* START signals 
(for Single accesses) or ESEQ outputs sent through 0-CON5TANT (for access sequences). The resulting 
event is used to. control a first-stage DEMUX which is inserted to select the relevant OUT output data 
packets of the sequence as described above for the basic method. Refer to the second example (Figures 
IS and 16) in Section 4.3 for a complete example. 

4*2.8 Input and Output Forts 

Input and output potts are processed similar to vector accesses. A read from an input port is like an 
anay read without an address. The input data packet is sent to DEMUX functions which send it tt> the 
correct subsequent operators. The STOP signal is generated in the same way as described above for 
RAM accesses by combining the INPOKTs U output with the current and other START signals. 

Output pons control the data packets by GATEs like array write accesses. The STOP signal is also ' 
created as for RAM accesses. 
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43 More Examples 

Fig. 7 show? the generated CDFG for the following for loop. 

a ■ b + cr t , 
for (i=0; i<-10; i++> { 

a = a + i; 

xti] » Jt; 

1 

In this example, ///l = {a} and IiV2 - {fc} (cf. Fig. 10). The MERGE function for variable a is 
replaced by a 2lTi data connection as mentioned in die footnote of Section 4.Z5. Note that only one 
data packet arrives for variables b, c and K, and one final packet is produced for a (out), forbody does 
not use a START event since both operations (the adder and The RAM -write) are dataflow-controlled 
by the counter any«ay. But the RAM's EWR output is the forbody's 5TAET nsv , and connected u> 
CNTs NEXT input. Note that the pipelining.opnrai2auon, cf. Section 4.2.6, was not applied here. If it 
is applied (which is possible for ibis loop). CNTs NEXT input is not connected, cf . Fig- I ' - Here, die 
loop iterations overlap. ST ART ^ is generated from CNTs U output and fotbody'S 5TAKT new (i.e. 
RAM's EWR output), as defined at the end of Section 4.2.5. 

The following program contains a veciorisable (pipelined) loop with one write access IP array (RAM) x 
and a sequence of two read accesses to array (RAM) y. After the loop, another single read access lo y 
occurs. 



z = 0; 

for (i^O; i<=l0; l+*> { 
z » z + y[i] +.y[2*i]; 

) 

a * y[k] ; ' 

Fig. 15 shows the inienucdiatc CDFG generated before the array access Phase 2 iransfonnaiion is ap- 
plied- The pipelined loop is controlled as follows: Within the loop, separate START signals for write 
accesses to % and read accesses lo y are used. The reentry to the f orbody is also controlled by two in- 
dependent signals ("cyder and "cyeleZ")* For the read accesses, "cycled guarantees thai the read y 
accesses occur in the correct order. Bui the beginning of an iteration for read y and write * accesses is 
not synchronised. Only at loop exit all accesses must be finished, which is guaranteed by signal "loop 
finished". The single read access is completely independent of the loop. 

Fig. 16 shows the final CDFG after Phase 2. Note that *tycter is removed since a single wile access 
needs no additional control, and **cycle2" is removed since the inserted MERGE and DEMUX functions 
automatically guarantee the correct execution order. The read y accesses are not independent anymore 
since they all refer to the same RAM, and the functions have been merged, ESEQs have been allocated 
to control the MERGE and DEMUX functions of the read sequence, and for the first-stage DEMUX 
functions which separate the read OUT values for die read sequence and for the final singjeread access. 
The ECOMBs, 1 -FILTERS, 0-CONSTANTs and 1-CONSTANTs are allocated as described in Section 
4.2,7. Phase 2, to generate correct control events for the GATEs and DEMUX functions. 
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